[NOISE] This lecture is about the ordinal
logistic regression for
sentiment analysis.
So, this is our problem set up for a
typical sentiment classification problem.
Or more specifically a rating prediction.
We have an opinionated text document d as
input, and we want to generate as output,
a rating in the range of 1 through k so
it's a discrete rating, and
this is a categorization problem.
We have k categories here.
Now we could use a regular text for
categorization technique
to solve this problem.
But such a solution would not consider the
order and dependency of the categories.
Intuitively, the features that can
distinguish category 2 from 1,
or rather rating 2 from 1,
may be similar to
those that can distinguish k from k-1.
For example, positive words
generally suggest a higher rating.
When we train categorization
problem by treating these categories as
independent we would not capture this.
So what's the solution?
Well in general we can order to classify
and there are many different approaches.
And here we're going to
talk about one of them that
called ordinal logistic regression.
Now, let's first think about how
we use logistical regression for
a binary sentiment.
A categorization problem.
So suppose we just wanted to distinguish
a positive from a negative and
that is just a two category
categorization problem.
So the predictors are represented as X and
these are the features.
And there are M features all together.
The feature value is a real number.
And this can be representation
of a text document.
And why it has two values,
binary response variable 0 or 1.
1 means X is positive,
0 means X is negative.
And then of course this is a standard
two category categorization problem.
We can apply logistical regression.
You may recall that in logistical
regression, we assume the log
of probability that the Y is equal to one,
is
assumed to be a linear function
of these features, as shown here.
So this would allow us to also write
the probability of Y equals one, given X
in this equation that you
are seeing on the bottom.
So that's a logistical function and
you can see it relates
this probability to,
probability that y=1
to the feature values.
And of course beta i's
are parameters here, so this is
just a direct application of logistical
regression for binary categorization.
What if we have multiple categories,
multiple levels?
Well we have to use such a binary
logistical regression problem
to solve this multi
level rating prediction.
And the idea is we can introduce
multiple binary class files.
In each case we asked
the class file to predict the,
whether the rating is j or above,
or the rating's lower than j.
So when Yj is equal to 1,
it means rating is j or above.
When it's 0,
that means the rating is Lower than j.
So basically if we want to predict
a rating in the range of 1-k,
we first have one classifier to
distinguish a k versus others.
And that's our classifier one.
And then we're going to have another
classifier to distinguish it.
At k-1 from the rest.
That's Classifier 2.
And in the end, we need a Classifier
to distinguish between 2 and 1.
So altogether we'll have k-1 classifiers.
Now if we do that of course then
we can also solve this problem
and the logistical regression program
will be also very straight forward
as you have just seen
on the previous slide.
Only that here we have more parameters.
Because for each classifier,
we need a different set of parameters.
So now the logistical regression
classifies index by J,
which corresponds to a rating level.
And I have also used of
J to replace beta 0.
And this is to.
Make the notation more consistent,
than was what we can show in
the ordinal logistical regression.
So here we now have basically k minus one
regular logistic regression classifiers.
Each has it's own set of parameters.
So now with this approach,
we can now do ratings as follows.
After we have trained these k-1
logistic regression classifiers,
separately of course,
then we can take a new instance and
then invoke a classifier
sequentially to make the decision.
So first let look at the classifier
that corresponds to level of rating K.
So this classifier will tell
us whether this object should
have a rating of K or about.
If probability according to this
logistical regression classifier is
larger than point five,
we're going to say yes.
The rating is K.
Now, what if it's not as
large as twenty-five?
Well, that means the rating's below K,
right?
So now,
we need to invoke the next classifier,
which tells us whether
it's above K minus one.
It's at least K minus one.
And if the probability is
larger than twenty-five,
then we'll say, well, then it's k-1.
What if it says no?
Well, that means the rating
would be even below k-1.
And so we're going to just keep
invoking these classifiers.
And here we hit the end when we need
to decide whether it's two or one.
So this would help us solve the problem.
Right?
So we can have a classifier that would
actually give us a prediction of a rating
in the range of 1 through k.
Now unfortunately such a strategy is not
an optimal way of solving this problem.
And specifically there are two
problems with this approach.
So these equations are the same as.
You have seen before.
Now the first problem is that there
are just too many parameters.
There are many parameters.
Now, can you count how many
parameters do we have exactly here?
Now this may be a interesting exercise.
To do.
So
you might want to just pause the video and
try to figure out the solution.
How many parameters do I have for
each classifier?
And how many classifiers do we have?
Well you can see the, and so
it is that for each classifier we have
n plus one parameters, and we have k
minus one classifiers all together,
so the total number of parameters is
k minus one multiplied by n plus one.
That's a lot.
A lot of parameters, so when
the classifier has a lot of parameters,
we would in general need a lot of data
out to actually help us, training data,
to help us decide the optimal
parameters of such a complex model.
So that's not ideal.
Now the second problems
is that these problems,
these k minus 1 plus fives,
are not really independent.
These problems are actually dependent.
In general, words that are positive
would make the rating higher
for any of these classifiers.
For all these classifiers.
So we should be able to take
advantage of this fact.
Now the idea of ordinal logistical
regression is precisely that.
The key idea is just
the improvement over the k-1
independent logistical
regression classifiers.
And that idea is to tie
these beta parameters.
And that means we are going to
assume the beta parameters.
These are the parameters that indicated
the inference of those weights.
And we're going to assume these
beta values are the same for
all the K- 1 parameters.
And this just encodes our intuition that,
positive words in general would
make a higher rating more likely.
So this is intuitively assumptions,
so reasonable for our problem setup.
And we have this order
in these categories.
Now in fact, this would allow us
to have two positive benefits.
One is it's going to reduce
the number of families significantly.
And the other is to allow us
to share the training data.
Because all these parameters
are similar to be equal.
So these training data, for
different classifiers can then be
shared to help us set
the optimal value for beta.
So we have more data to help
us choose a good beta value.
So what's the consequence,
well the formula would look very similar
to what you have seen before only that,
now the beta parameter has just one
index that corresponds to the feature.
It no longer has the other index that
corresponds to the level of rating.
So that means we tie them together.
And there's only one set of better
values for all the classifiers.
However, each classifier still
has the distinct R for value.
The R for parameter.
Except it's different.
And this is of course needed to predict
the different levels of ratings.
So R for sub j is different it
depends on j, different than j,
has a different R value.
But the rest of the parameters,
the beta i's are the same.
So now you can also ask the question,
how many parameters do we have now?
Again, that's an interesting
question to think about.
So if you think about it for a moment, and
you will see now, the param,
we have far fewer parameters.
Specifically we have M plus K minus one.
Because we have M, beta values, and
plus K minus one of our values.
So let's just look basically,
that's basically the main idea of
ordinal logistical regression.
So, now, let's see how we can use such
a method to actually assign ratings.
It turns out that with this, this idea of
tying all the parameters, the beta values.
We also end up by having
a similar way to make decisions.
And more specifically now, the criteria
whether the predictor probabilities
are at least 0.5 above,
and now is equivalent to
whether the score of
the object is larger than or
equal to negative authors of j,
as shown here.
Now, the scoring function is just
taking the linear combination of
all the features with
the divided beta values.
So, this means now we can simply make
a decision of rating, by looking at
the value of this scoring function,
and see which bracket it falls into.
Now you can see the general
decision rule is thus,
when the score is in the particular
range of all of our values,
then we will assign the corresponding
rating to that text object.
So in this approach,
we're going to score the object
by using the features and
trained parameter values.
This score will then be
compared with a set of trained
alpha values to see which
range the score is in.
And then,
using the range, we can then decide which
rating the object should be getting.
Because, these ranges of alpha
values correspond to the different
levels of ratings, and that's from
the way we train these alpha values.
Each is tied to some level of rating.
[MUSIC]

